Identifying unique spectral fingerprints in cough sounds for diagnosing respiratory ailments

Coughing, a prevalent symptom of many illnesses, including COVID-19, has led researchers to explore the potential of cough sound signals for cost-effective disease diagnosis. Traditional diagnostic methods, which can be expensive and require specialized personnel, contrast with the more accessible smartphone analysis of coughs. Typically, coughs are classified as wet or dry based on their phase duration. However, the utilization of acoustic analysis for diagnostic purposes is not widespread. Our study examined cough sounds from 1183 COVID-19-positive patients and compared them with 341 non-COVID-19 cough samples, as well as analyzing distinctions between pneumonia and asthma-related coughs. After rigorous optimization across frequency ranges, specific frequency bands were found to correlate with each respiratory ailment. Statistical separability tests validated these findings, and machine learning algorithms, including linear discriminant analysis and k-nearest neighbors classifiers, were employed to confirm the presence of distinct frequency bands in the cough signal power spectrum associated with particular diseases. The identification of these acoustic signatures in cough sounds holds the potential to transform the classification and diagnosis of respiratory diseases, offering an affordable and widely accessible healthcare tool.


Spectral analysis
Power spectral density (PSD) is given as a frequency-domain plot 33 showing the strength of the fluctuation (energy) as a function of frequency.In a time series, given an audio signal, x(t), and its Fourier transform f(q), the average power of the signal and the PSD are respectively defined as 34 : The Python Scipy library 35 provides computation for Welch and periodogram methods 36,37 , which enabled us to estimate the PSD, using its default parameters, such as Hann window, on the audio data that was sampled at a frequency of 22.5 kHz and of approximately 10 ms of duration each.We then used the PSD values to compare the COVID-19-positive and COVID-19-negative coughs.For this, we generated a PSD plot using cough sound signals not only to compare COVID-19-positive and COVID-19-negative coughs but also to isolate asthma and pneumonia subgroups of COVID-19.We then considered the PSD plot to compare asthma and pneumonia with COVID-19.

Separability test
For a certain frequency band (F 1 , F 2 ) , we calculated the relative power of cough sound signals for the groups as follows: • Each audio file was passed through a band-pass Butterworth filter (with roll-off at 6dB/octave) to retrieve the signal in a certain frequency band (F 1 , F 2 ).• The relative power (RP) of the signal extracted at each interval of frequency was derived from the formula RP i,1−2 = P i (F 1 −F 2 ) P i (F 0 ,F f ) where P i the power value of a sample i of audio from the PSD and F 0 = 1 Hz and F f = 4 kHz.The wide band [F 0 F f ] defines a low-pass filter we performed on all sound signals to eliminate noise.
• The average RP µ was computed for all cough audio samples, each relevant category of illness, and the control group.The standard deviation was denoted σ and similarly computed.The linear separability criterion was denoted J and equal to 38 : where as previously mentioned, µ Exp and µ Ctr are defined as the average RP for the relevant experimental and control groups, respectively, and σ Exp and σ Ctr as the standard deviations within the signals of the experimental and control groups, respectively.

Nearest neighbors classifier
KNN is a supervised learning technique that classifies data by finding the k closest data points in a feature space to a given test point, and then classifying the test point based on the majority class among its k nearest neighbors.Unlike LDA, KNN assumes no underlying data probability distribution or linearity of the decision boundary between classes.Analogous to the LDA algorithm, we applied the KNN algorithm to the RP values of both the pneumonia and asthma groups (without confounding COVID-19), as well as to the COVID-19-positive and control groups, to determine whether they were indeed separable within the frequency bands of interest.

Results
Initially, we studied the PSD values, focusing on analyzing paired samples of COVID-19 versus the control group, pneumonia against asthma, and COVID-19 against both asthma and pneumonia.To ensure clarity, Table 1 details the number of segmented cough audio samples used for the computations and the number of patients from the Coswara dataset reported to be affected by each of the relevant illnesses.Patient numbers and samples numbers are different in that the samples are constituted of the isolated audio of a cough event: a total of 1183 cough segments of COVID-19 positives versus 341 audio segments in the COVID-19 negative category.To palliate to imbalance in classes we used under-sampling.
A preliminary exploration of the PSD values, displayed in Fig. 1, revealed observable differences between the curves for each pair of categories.Figure 1a shows a comparison of the PSD values for COVID-19-positive patients' coughs and the control group's coughs.The first noticeable difference occurred in the range of frequencies between 100 and 800 Hz, and the second in the frequencies approaching 1500 Hz.These frequencies are in the low-and high-frequency ranges, respectively.The COVID-19-positive PSD curve revealed a peak at 250 Hz, which was higher in amplitude than that revealed by the COVID-19-negative curve-a characteristic that we observed in the COVID-19-positive curves of all three plots (see Fig. 1a-c).A second peak at around 1500 Hz is also distinguishable in all three figures, which was half the amplitude of the peak of the control group, as seen in Fig. 1a.These features allowed us to separate COVID-19 from other illnesses, and vice versa, within the two regions of frequencies corresponding to the previously mentioned low-and high-frequency ranges.Comparing asthma and pneumonia without confounding COVID-19, we observed a difference between the two curves at the frequencies of 80-220 Hz (a relatively very low frequency range) and 616-660 Hz.We also observed similar differences in the three peaks of asthma-infected patients at 1616 Hz, 2040 Hz, and 3450 Hz.The final peak at 3450 Hz in the PSD curves of the asthma samples was the lowest and negligible for our analysis.The differences in the PSD curves were indicative of patterns in the cough sound signals (i.e., notable and distinguishable in COVID-19-positive patients) and potentially valuable for use with a diagnostic machine learning classifiers.Comparing COVID-19-positive and COVID-19-negative asthma and pneumonia samples provided secondary support for related frequency bands capable of isolating COVID-19 from other respiratory illnesses.It also highlighted that asthma and pneumonia curves exhibited differences in lower and higher frequencies, indicating distinct patterns in the cough sounds associated with these illnesses.
The linear separation value J(F 1 , F 2 ) , estimated in the range 1-3000 Hz in Fig. 2 consistently showed two distinct regions with large J values indicating higher separability.The first region we observed in Fig. 2a aligned with the PSD values highlighted previously, spanning the 1-1500 Hz frequency bands.The second was between 1000 and 2500 Hz.These identified regions will be labeled Region 1 and Region 2 for the rest of our analysis.The three-dimensional plot of the linear separation value J(F 1 , F 2 ) , calculated within the frequency range of 1-3000 Hz, is presented in Fig. 2b.The maximum J values is highlighted with a tick in the plot.In Fig. 2b, the maximum of the separability value is in the 1000-1600 Hz band, and the second maximum J value (i.e., the second peak) is located in the 200-800 Hz band.Both regions tally with the previous observations shown in Fig. 1.In summary, the representations of the J separability criterion of the cough sound signals' RP values shown in the plots in Table 1.Number of patients in dataset categories composition.The patients and samples differed in that the samples consisted of isolated audio recordings of cough events.Each patient's audio file contained several such events.Fig. 2 provide a comprehensive view of the patterns and frequency bands of interest, support our preliminary observations of the PSD curves in Fig. 1, and could be valuable for developing diagnostic algorithms.The next step involved assessing the p values resulting from the Mann-Whitney U test 39 .The t-test can be less reliable with unequal sample sizes, however, Mann-Whitney U test does not have this limitation 40 .In addition, due to the non-parametric nature of Mann-Whitney U test, it makes the analysis more generalizable 41 .Table 2 presents the results of the statistical test's heuristic search and consequently only includes the minimum values returned by the U-test.The frequency bands featured are encompassed by the same frequency regions as described in Fig. 1 (around 300-800 Hz and 1100-1850 Hz for the RP distributions of the COVID-19-positive versus COVID-19-negative control group, and around 100-800 Hz and 1400-2100 Hz for the frequency bands of the asthma and pneumonia categories without confounding COVID-19.The p values ranged from 6.53 × 10 −4 to 0.0454 and were below .05for all the frequency bins we computed (i.e., 300-400 Hz and 1100-1650 Hz for the COVID-19-positive group versus the COVID-19-negative group, with p values equal to 6.53 × 10 −4 and 4.58 × 10 −4 , respectively.The first optimal frequency band was within Region 1, and the second was within Region 2. The optimal frequency band was 1400-1600 Hz for the computation of the asthma versus pneumonia p values, with a value of 1.39 × 10 −3 .For comparison, the COVID-19-positive versus COVID-19-negative and the asthma versus pneumonia p values for the RP in the frequency band of 1-3000 Hz were 0.49 and 0.66, respectively.
Both the low and high p values presented herein are consistent with our preliminary observations and showed significant differences in the RP values between the COVID-19-positive and COVID-19-negative control groups.They also supported the existence of frequency bands capable of separating each of the categories, in line with our preliminary observations and demonstrating the presence of an acoustical signature for the medical conditions causing coughs.We devoted the final part of the study to classifying the RP values of the optimal frequency bands, as discussed previously, using LDA and KNN machine learning classifiers.It is crucial to highlight that the availability of samples for COVID-19-positive and COVID-19-negative cases significantly influenced our data selection strategy.To ensure the integrity and impartiality of our analysis, we deliberately opted for the random selection of a representative subset equal to 341 samples for each class, thereby mitigating potential biases in our findings.The pneumonia and asthma categories from the Coswara dataset were smaller and formed a balanced dataset of 48 asthma samples and 50 pneumonia samples.
The performance characteristics of the LDA and KNN classifiers are displayed in Table 3 using the area under the curve (AUC), accuracy, and Matthew's correlation coefficient (MCC) metrics.While a receiver operating characteristic curve plots a true positive rate against a false positive rate, the AUC is a measure of a classifier's overall performance.Likewise, accuracy is defined as the proportion of correctly classified samples.The MCC, however, ranges from −1 to +1 , with a value of +1 representing a perfect prediction, 0 indicating a random pre- diction, and −1 indicating complete disagreement between the predicted and true labels.For COVID-19-positive versus COVID-19-negative LDA results, the best accuracy and AUC were achieved using the RP values of the respective frequency bands: 1100-1650 Hz and 300-800 Hz.The first frequency band, 1100-1650 Hz, corresponded to the optimal band identified previously, with the lowest Mann-Whitney U test p value.The second frequency band of interest, which was identified as an optimal band at 300-400 Hz, had the lowest accuracy and  AUC of all the values, but they were still significant at 0.80 and 0.71, respectively.Similarly, the highest accuracy and AUC performance values for the classification of asthma versus pneumonia were achieved using the RP of the 50-900 Hz and 450-1000 Hz frequency bands, respectively.Again, the optimal frequency band identified in the Mann-Whitney U test differed from that for the highest LDA scores.However, the optimal band at 1400-1650 Hz exhibited sufficiently high levels of accuracy and AUC (0.89 and 0.90, respectively), indicating the ability of the LDA classifier to adequately distinguish between the pneumonia and asthma samples based on their respective RP values.For both the COVID-19-positive versus COVID-19-negative and pneumonia versus asthma samples, the LDA classifier did not reach optimal performance consistently for the previously identified optimal frequency bands based on the RP values.Nevertheless, it achieved adequate and comparable performance values for all the optimal bands, allowing us to deduce arguments similar to those for the KNN classifier performance values.Unlike the LDA classifier, for the COVID-19-positive group versus the control group, the 1100-1850 Hz band outperformed the optimal band within that region (i.e., 1100-1650 Hz), and the 300-400 Hz band had the highest overall performance, with an MCC value of 0.71.For the pneumonia versus asthma KNN classification, the low-frequency 50-900 Hz range yielded the highest MCC performance values, although both frequency bands in Region 1 had similar performance metric values.In short, the KNN classifier performed almost identically to the LDA classifier in the second region, with high metrics values for the 1400-1600 frequency band.Notably, this analysis was conducted using linear and non-linear machine-learning classifiers.This means that, altogether, these results corroborated our preliminary observations, indicating the existence of a set of desirable and optimal frequency bands in cough sounds that, using machine learning classifiers, we can leverage to detect and thus diagnose a cough's underlying ailment.

Discussion
The application of the Mann-Whitney U test to our study data revealed statistically significant differences in the RP of cough sound signals between COVID-19-positive individuals and a COVID-19-negative control group, particularly in the 300-400 Hz and 1100-1650 Hz frequency bands.The notably low p values in these tests strongly suggest that these differences are not random occurrences, thereby supporting our hypothesis that the RP distributions in cough sound signals are distinctively different between the two groups.These results underscore the potential of RP in cough sound signals as a significant biomarker for COVID-19, especially within these specific frequency bands.Further analysis using LDA for classification based on the RP of cough signals indicated a high level of accuracy in differentiating COVID-19-positive from COVID-19-negative patients across various frequency bands.Notably, the frequency band of 1100-1850 Hz demonstrated the highest classification accuracy, reaching a value of 1.88.Additionally, comparisons of RP values for pneumonia versus asthma across all relevant frequency bands showed substantial accuracy, exceeding 0.78.These findings reinforce the premise that the RP of cough signals is a reliable indicator for distinguishing not only between COVID-19 status but also between other respiratory conditions like pneumonia and asthma within these frequency ranges.
It is important to acknowledge that this study relied on crowdsourced data rather than clinical data.The cough recordings were self-reported by patients, which introduces a degree of uncertainty regarding their accuracy.Moreover, the control group was relatively small and lacked diversity, with a significant proportion of the dataset comprising male participants aged between 20 and 35 years, accounting for approximately 70% of the total sample.This demographic skew poses limitations on the study's statistical significance and generalizability, indicating a need for more diverse and representative data in future research.
Additionally, the statistical methodologies employed in this study come with inherent limitations.For instance, the Welch method used for calculating the PSD provides only an approximation.These methodological constraints suggest that our findings should be considered preliminary, necessitating further research and validation.www.nature.com/scientificreports/ Our study contributes novel insights into the field of respiratory disease diagnosis using cough sound signals, particularly for COVID-19, augmenting the existing body of literature that predominantly focuses on deep learning models like convolutional neural networks.Unlike previous studies which have reported varied model performance accuracy for COVID-19 prognosis, our approach employs classical machine learning classifiers and uniquely identifies specific frequency bands within the power density spectrum of cough signals for disease diagnosis.This methodological divergence not only confirms the potential of cough as a biomarker but also adds a new dimension to its analysis in the context of respiratory illnesses.

Conclusion
This study highlights distinct physical and acoustical differences in cough sound signals linked to various respiratory diseases, with a focus on COVID-19.By comparing these signals with those from a control group of patients suffering from other respiratory ailments, we aimed to establish a benchmark for COVID-19 cough characteristics.While the limited size and diversity of the control group posed challenges, our findings significantly contribute to the understanding of cough sounds as biomarkers for respiratory conditions.We successfully identified specific low-and high-frequency bands associated with different diseases, demonstrating the potential of cough acoustics in diagnosing conditions like COVID-19, asthma, and pneumonia.This research represents an initial step towards a comprehensive understanding of cough sounds and their diagnostic potential.

Figure 1 .
Figure 1.Power spectrum density (PSD) computed on each audio using the Welch method and averaged for all population cough signals extracted from the crowdsourced publicly available dataset Coswara 29 .The dataset's recordings were collected from participants using their built-in computer microphone, the sampling rate of the data is 22,050 Hz.(a) Comparison of PSD between cough signals of COVID-19-positive and COVID-19-negative patients.Zoom-in view between 1 and 4 kHz.(b) Comparison of PSD between cough signals of pneumonia COVID-19-positive and COVID-19-negative patients.Zoom-in view between 1 and 4 kHz.(c) Comparison of PSD between cough signals of asthma COVID-19-positive and COVID-19-negative patients.Zoom-in view between 1 and 4 kHz.(d) Comparison of PSD between cough signals of patients with asthma and pneumonia.Zoom-in view between 1 and 4 kHz.

Figure 2 .
Figure 2. 2-D and 3-D visual representations of the linear separation value J optimization process for distinguishing between COVID-19-positive and COVID-19-negative patients using cough signals, extracted from the publicly available Coswara dataset 29 .(a) 2-D visualization of the optimized separation value over different frequency bands, demonstrating the clear differentiation between COVID-19-positive and COVID-19-negative patients.The white dotted lines on the 2-D plot indicate the boundaries of distinct regions-Region 1 (left) and Region 2 (right).(b) 3-D visualization of the optimization process, providing a more comprehensive view of the separation between the two groups and highlighting the potential of utilizing cough signals as a diagnostic tool for COVID-19.The dataset's recordings were collected from participants using their built-in computer microphone, the sampling rate of the data is 22050 Hz.

Table 2 .
p values for the COVID-19-positive group versus the COVID-19-negative control group and the pneumonia group versus the asthma (COVID-19-negative) group.We computed p-values using a Mann-Whitney U test to analyze the RP of signals within frequency bands of interest.

Table 3 .
LDA and KNN classification performance results for the COVID-19-positive group versus the COVID-19-negative control group, and the pneumonia group versus the asthma group.Performance metrics were accuracy, MCC, and AUC, and the classifier computations were based on the RP of signals within the frequency bands of interest.LDA refers to linear discriminant analysis, KNN refers to k-nearest neighbors, MCC refers to Matthew's correlation coefficient and AUC refers to area under the ROC curve.